Goto

Collaborating Authors

 fine-grained visual prompting


Supplementary Materials for " Fine-Grained Visual Prompting " Lingfeng Y ang 1, Y ueze Wang

Neural Information Processing Systems

By applying a single blur operation, we can retain more spatial relevance information. Moreover, since the images are blurred, they may have a relatively minor impact on the recognition ability of CLIP on the target.



Fine-Grained Visual Prompting

Neural Information Processing Systems

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background.


Supplementary Materials for " Fine-Grained Visual Prompting " Lingfeng Y ang 1, Y ueze Wang

Neural Information Processing Systems

By applying a single blur operation, we can retain more spatial relevance information. Moreover, since the images are blurred, they may have a relatively minor impact on the recognition ability of CLIP on the target.



Fine-Grained Visual Prompting

Neural Information Processing Systems

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels.